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3.1 DPtool: Main anonymization methods 


Executive Summary 


This deliverable provides an overview of the tools and applications developed in the MOSAICrOWN 
project to satisfy the use cases by the industry partners (EISI, SAP SE, MC). The tools leverage 
the technologies developed in Work Packages 3—5. The use cases and their respective tools are 
summarized in the following. 


Use Case 1: Tools for ICV platforms (EISI). Use Case 1 considers data protection in Intelligent 
Connected Vehicle (ICV) platforms to facilitate data exchange between a fleet manager, 
Electric Vehicles (EV), and their charging infrastructure to gather insights and, e.g., inform 
infrastructure planning and allocation for EV charging. 


The tools for Use Case 1 provide a platform for storing and accessing data securely and are 
summarized next. The automotive tools facilitate data ingestion from the electric vehicle 
to the data market and cover the non-functional requirements of the data economy. The 
tools include an application which can be integrated into the electric vehicle. The web tools 
facilitate the data access and authorization mechanisms depending on the access rights of 
each user via a web-based user interface. The policy engine parses the MOSAICrOWN 
policy and permits or denies requests enabling access control in the context of the ICV 
platform. The encrypted filesystem provides encryption for the secure storage of privacy 
related personal data. It includes tools developed by academic partners, FreyaFS. 


Use Case 2: Tools for financial data markets (MC). Use Case 2 is concerned with transaction- 
level financial data and data wrapping techniques for the final purpose of analytics and the 
extraction of business insights from the data itself. 


The developed application allows users, once connected to the platform via a web browser, 
to upload the dataset to be anonymized. The platform proposes wrapping techniques for 
each field (according to the policy regulation selected), once confirmed or overwritten by 
the user, the platform anonymizes the dataset. In more detail, the identity component au- 
thenticates users to use the offered services. The semantics/wrapping component performs 
a semantic analysis to detect, e.g., data types and proposes protection techniques. Users 
can then define and apply the wrapping techniques for each field. The analytics component 
gives users access to a dedicated dashboard to generate insights based on analytics over the 
anonymized data. The analysis provides aggregated results, so no granular information is 
presented. 


Use Case 3: Privacy-preserving tools for a cloud-based data market (SAP SE). Use Case 3 op- 
erates in a cloud-based data market where businesses acting as data providers aim to perform 
analytics over their data while protecting the underlying information. For example, a pro- 
ducer and retailer want to learn supply chain insights and improve allocation of a marketing 
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budgets from operational data (indicators for industry benchmarking, e.g., return rates, pro- 
cess times) and customer experience data (ratings, purchase history) while ensuring their 
sensitive business and customer data is protected. 


To provide privacy-preserving analytics in the context of the use case as well as appropri- 
ate guided parameterization based on sanitization techniques satisfying differential privacy, 
three main tools are provided: DPtool, MIA, DPsc. DPtool provides a REST API as well 
as graphical web-interface for the selection, parameterization and application of various dif- 
ferential privacy mechanisms (e.g., perturbation via Laplace or Geometric noise) on data in 
cvs format. MIA helps data scientists with the parameterization of differential privacy for 
machine learning applications as well as its evaluation via membership inference risks via 
a graphical web-interface. DPsc additionally leverages cryptographic techniques to only 
share anonymized analytics in the form of rank-based statistics (1.e., percentiles) securely 
computed over distributed data of multiple parties. 
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1. Use Case 1 (EISI): Tools for ICV platforms 


This section illustrates the tools for Use Case 1 (UCI) used in the development of prototype 
applications. The task that this deliverable is based on is T2.3 “Testbed platform and deployment 
to Use Cases”, with specific reference to UCI. T2.3 leverages the technologies developed in 
WPs 3-5. The tools presented in this deliverable provide an automotive interface to the platform, 
a web-based front end to the platform, and an encrypted file system used for the secure storage of 
personal data. 


1.1 Introduction 


The emphasis of UCI, data protection in ICV (Internet Connected Vehicle) platforms, 1s on data 
ingestion, storage and analytics, taking into account data governance policy, data wrapping and 
Sanitization. 


[ = 


Figure [1.1] gives an overview of UCT”s data flow, showing the exchange of data between the 


Figure 1.1: Overview of UCT”s data flow 


main components of the platform. Each stage of the data flow 1s described below: 


1. Ingestion of ICV data into the MOSAICrOWN platform, which is obtained from the vehicles 
being monitored. The data is in JSON format, and is a combination of dynamic metrics and 
the metadata that describe them. 


2. Ingestion of EV charging station data into MOSAICrOWN platform. The data is in CSV 
format. 


3. Initial processing of data into the MOSAICrOWN platform. The metadata is extracted from 
the data and converted into a canonical JSON-LD (JSON Linked Data) format. As part of 
the process, data 1s cross referenced to the metadata. 


4. The data and metadata are stored in the MOSAICrOWN platform. The data is stored on a 
high capacity file system and the metadata is stored in a knowledge base, which is imple- 
mented as an RDF triple store. 


13 


14 Use Case 1 (EISI): Tools for ICV platforms 


5. Analytics is performed by the MOSAICrOWN platform. 


6. The data and analytics are made available to the user. The type of data and any redactions 
are dependent on the user’s role, and consequent access rights. 


The remainder of this chapter is organized as follows. Section gives some background 
to UCI. Section [1.3] discusses the automotive interfaces evaluated for the platform, and how the 
selected interface was implemented. Section [1.4] explains the Web UI implemented by the plat- 
form, and the ways different users can interact with it. Section |[I.5]covers the implementation of 
the MOSAICrOWN policy engine and the enhancements implemented for UC1. Section|1.6] dis- 
cusses the encrypted file system for the secure storage of privacy-related personal data. Section|I.7]| 
summarizes the outcomes of UCI. 


1.2 Use Case Background 


UCI demonstrates how an automotive data market can be implemented using the MOSAICrOWN 
platform. The use case describes how automotive user data from electric vehicles can be safely 
sanitized/wrapped and monetized through the data market. The driver/data owner may participate 
in the data market by selecting policies to be applied to their data in exchange for discounted 
electric vehicle charging rates. The Web UI of the platform show cases how the privacy of data and 
the access rights of users can be enforced. Several critical stages in the data flow were considered 
when implementing the use case, including: 


e Raw data ingestion 
e Intermediate storage of data prior to processing 
e Dynamic application of privacy-preserving techniques 


e Access control and enforcement through a data market service. 


One innovative feature of UCI is the use of containers to encapsulate significant stages in 
the data flow. Containerizing these stages allows technologies to be swapped in and out of the 
solution without significantly effecting other stages, thus reducing dependencies and increasing 
flexibility in technology selection. Additionally, containerization greatly simplifies deployment of 
the solution in different environments. UCI used Docker to implement the containerization, but 
any suitable containerization technology could have been used. The use case also demonstrated 
how the data storage could be externalized from the containers and abstracted behind an open API 
such as S3. 


Ingestion and Filtering 


The initial ingestion of vehicle, charging station and policy data is handled in Apache Nifi, an 
open source framework for data processing and distribution. Nifi is responsible for many of the 
key intermediary processes applied to the raw data, including the data/metadata extraction, the 
conversion of metadata and policies to RDF, and the placement of data/metadata in relevant loca- 
tion for access from the data market. 
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Nifi also incorporates the data market filter functionality whereby an appropriate privacy- 
preserving mechanism is first carried out prior to the data reaching the data market. This is imple- 
mented as a policy-dependent routing mechanism. Depending on the current metadata policy at 
ingestion time, data can be routed to a service to provide wrapping or instantiation. 


Data Lake/Storage 


The data lake and data market components are built on open-source solutions. Apache Hadoop is 
responsible for the storage of raw and treated data while Apache Jena is used to store metadata in 
RDF format. Data can be accessed through SPARQL queries to Jena. Access control and policy 
enforcement are applied before retrieving and returning data from HDFS through a WebHDFS 
interface. 

In the following sections the tooling developed and integrated with these core platform com- 
ponents are detailed. 


1.3 Automotive Tools 


The tools described in this section address a number of requirements presented in Deliverable 
D2.1 “Requirements from the Use Cases”. Specifically, the automotive tools facilitate data in- 
gestion from the Electric Vehicle (EV) to the data market as well as enable the data economy 
non-functional requirements. Briefly, the tools take the form of an application to be integrated 
into the EV as well as modifications to the MOSAICrOWN governance framework. In the fol- 
lowing, we present each component of the tools and we describe a scenario which facilitates the 
monetization of private data for the benefit of the data owner. 


Electric Vehicle Requirements 


In UCI, the connected vehicle fleet manager and the EV charging infrastructure provider want 
to exchange data such that they can derive mutually beneficial insights into the status of the EV 
charging infrastructure. To that end, the data of the EV and the behavioral data of the driver 
are required. The most relevant requirements are that ingestion mechanism should support real- 
time stream data handling (REQ-UCI1-DI2) as well as ingestion from multiple concurrent sources 
(REQ-UC1-DI0) and that the platform should support a licensed model (REQ-UCTI-DE1). 

To facilitate the ingestion of data from the EV, we carried out an analysis of the available 
options to integrate tools into the EV, developed an application to integrate with the EV, modified 
the MOSAICrOWN data governance framework to ingest that data, and finally, augmented our 
design such that monetization opportunities are demonstrated. 


In-vehicle Infotainment software 


To select the tools used by UCI, we first performed a review of the existing automotive tools for 
driver UIs, called In-vehicle Infotainment (IVI). We evaluated the following IVI systems [[(CM16| 
Koo21], on which we comment next. 


Considered IVI systems 


Apple CarPlay: it is an Apple standard that enables a car radio or head unit to be a display and a 
controller for an 10S device. 
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Android Auto: it is a mobile app developed by Google to mirror features of an Android device, 
such as a smartphone, on a car’s dashboard information and entertainment head unit. 


GENIVI: it is an industry alliance involving OEMs, such as Bosch and Hyundai, working on an 
open-source infotainment platform for vehicles. 


Tizen IVI: it is an open-source project, part of the Linux Foundation and carried by Intel, provid- 
ing a free environment based on HTMLS. 


QNX Car platform: it allows the development of customizable apps for radio, weather and location- 
based systems, and web applications based on HTMLS, CCS3, and JavaScript. 


Windows Embedded Automotive: it allows the leveraging of Silverlight for the creation of ad- 
vanced HMI interfaces. 


Automotive Grade Linux: it is an open-source stack for in-vehicle infotainment based on Tizen 
IVI (of which it 1s a bootable distribution). 
Evaluation of IVI systems 


We carried out an evaluation of the IVI systems introduced in the previous section with a goal of 
identifying the most suitable IVI system to use as a prototype tool for use within an EV. This tool 
needed to satisfy the requirements presented in Section [1.3] and also, it should be user friendly. 
Our evaluation is as follows: 


Apple CarPlay: Closed ecosystem, with difficulties in running development environment on Mi- 
crosoft Windows based computers. 


Android Auto: Depends on proprietary tools, requires Java or Kotlin knowledge. 

GENIVI: Steep learning curve, experienced difficulties in installing development environment. 
Tizen IVI: Steep learning curve, experienced difficulties in installing development environment. 
QNX Car platform: Licensed platform, well-regarded Real-Time Operating System (RTOS). 
Windows Embedded Automotive: No longer supported. 

Automotive Grade Linux: Difficult to install, development of applications was too cumbersome. 


Based on our evaluation of the available [VI systems, Android Auto was selected to develop the 
MOSAICrOWN mobile application. 


1.3.1 Android Auto 


Android Auto 1s a mobile application developed by Google to replicate the functionality of an 
Android device, such as a smartphone, on a car’s dashboard information and entertainment head 
unit. It provides a driver-optimized app experience for users with an Android phone and the 
Android Auto app, but who do not have a vehicle that uses Android Automotive OS. As of April 
2021, Android Auto is available in 42 countries. An example of an installation of Android Auto 
in a Polestar 2 EV is show in Figure[1.2] 
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Figure 1.2: Android Auto installation in Polestar 2 Electric Vehicle [Pol] 


Android Software Development Kit (SDK) 


Applications on Android platforms are typically established in Java programming language using 
the Android SDK as well as other development environments and the Android Debug Bridge 
(ADB). The Android SDK consists of a comprehensive set of development tools such as 


e debugger 

e libraries 

e handset emulator 
e documentation 

e sample code 


e tutorials 


Android applications come in .apk format and saved under /data/app folder on the Android 
OS (this folder can only be accessed by the root user for security purposes). APK package con- 
tains .dex files (this is a set of compiled byte code files called Dalvik executables), resource files, 
etc. The Android Debug Bridge (ADB) 1s a toolkit encompassed in the Android SDK package. 
The ADB comprises of both client and server-side programs that communicate with one another 
and characteristically read through the command-line interface. However, multiple graphical user 
interfaces exist to control ADB. 


Android Auto emulator 


The Desktop Head Unit (DHU) enables the development machine to emulate an Android Auto 
head unit, so that a developer can run and test Android Auto apps. The DHU runs on Windows, 
MacOS, and Linux systems. 
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1.3.2 MOSAICrOWN Governance Framework Modifications 


We linked the policy language introduced in Deliverable D3.3 “First Version of Policy Specifica- 


tion Language and Model” with data being ingested from an EV. For UCI, we decided to allow 


the EV driver to decide which policy 1s applied to their data. To simplify the interaction for the 


EV driver we predefined three levels of policies. These are: a) a policy with the most private 


settings described, b) a policy with “moderate” privacy settings stipulated, and c) a policy with 


the most permissive privacy settings described. For illustration, we present the most permissive 


privacy policy in Listing Most permissive privacy policy is the policy where all the data is 


accessible without any restrictions to the user, regardless of their assigned role on the platform. In 


the Listing|1.1| we can see the permissions given to fleet manager. 


"http://www.w3.org/ns/odrl.jsonld", 


"http://localhost:8000/ns/mosaicrown/namespace.jsonld" 


{ 
"@context": [ 
ie 
"Otype": Setan 
tura" 


"permission": 
tada" : 
"assignee": 
"target": 


Mi MOSAICrOWN 


[ 


"http://dellemc.com/policy/MostPermissivePrivacyPolicy", 


"http://dellemc.com/policy/MostPermissivePrivacyPolicies_perm", 


"http://dellemc.com/user/fleetmanager", 


["http 


"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 
"http 


://dellenmc. 
://dellemc. 
://dellemc. 
://dellenmc. 
://dellenmc. 
://dellenmc. 
://dellemc. 
://dellenmc. 
://dellenmc. 
://dellenmc. 
://dellemc. 
://dellenmc. 
://dellemc. 
://dellenmc. 
://dellemc. 
://dellemc. 
://dellemc. 
://dellenmc. 
://dellenmc. 
://dellemc. 
://dellenmc. 
://dellenmc. 
://dellenmc. 
://dellemc. 


latitude", 


http: 


//dellenmc. 


longitude", 


"http: 
het p: 


//dellenmc. 
//dellenc. 


latitude", 


"http: 


//dellenmc. 


longitude", 


com/icv/licensePlate", 

com/icv/vin“, 

com/icv/name", 

com/icv/category", 

com/icv/type", 

com/icv/modelName", 
com/icv/modelYear", 
com/icv/colourName", 
com/icv/numberOfDoors", 
com/icv/numberOfSeats", 
com/icv/gearbox", 
com/icv/displayUnit", 
com/icv/driverSeatLocation", 
com/icv/charging/estimatedRange", 
com/icv/charging/batteryLevel", 
com/icv/diagnostic/mileage", 
com/icv/diagnostic/batteryVoltage", 
com/icv/diagnostic/speed", 
com/icv/ignition/status", 
com/icv/usage/averageFuelConsumption", 
com/icv/usage/averageWeeklyDistance", 
com/icv/usage/lastTripEnergyConsumption", 
com/icv/usage/lastTripFuelConsumption", 
com/icv/naviDestination/coordinates/ 


com/icv/naviDestination/coordinates/ 


com/icv/naviDestination/destinationName", 
com/icv/naviDestination/vehicleLocation/ 


com/icv/naviDestination/vehicleLocation/ 
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"http://dellemc.com/icv/naviDestination/vehicleLocation/ 


heading", 
"http://dellemc.com/icv/naviDestination/coordinates/ 
longitude" 
IE 
action yl odri read" "odri uset Todri vrite Todri. selin odrl. 
sellReport"], 
"purpose": ["statistical", "marketing"] 


} 


Listing 1.1: An example of a most permissive privacy policy expressed in JSON-LD 


1.3.3 Monetization of Private Data via Automotive Tools 


The Android application developed for UC1 allows the driver to select what policy they would 
like applied to their PII based on the use the data would be put to, e.g. statistical. Figure[1.3] shows 
an instance of the application prior to the driver charging their vehicle. The ‘Incognito’ option is 
the most private policy setting, ‘Confidential’ is moderate privacy setting, and ‘Public’ is the most 
permissive policy setting. Figure shows the application when the driver presents the vehicle 
for charge. The driver is presented with options as to which policy the want applied to their data 
from that time onwards in exchange for reduced electricity cost for charging their vehicle. This 
policy can be applied for either a set period of time or until the next charge. 


1.4 Web UI 


This section explains the Web Ul, implementing some of the requirements presented in the De- 
liverable D2.1 “Requirements from the Use Cases”. The main functionality of the Web UI is to 
enable data access by different users (MOSAICrOWN Cloud Provider, Car Driver, Fleet Owner 
and EV Charging Infrastructure Provider) according to their access rights specified in the MO- 
SAICrOWN policies. In more detail, the Web UI addresses the requirement for access control to 
the data with specific levels of granularity based on the policy constraints (REQ-UCI1-AC1), re- 
stricting data sets and/or users for certain operations (e.g., only allows sharing; REQ-UCI1-AC4), 
allow data sharing with specific or multiple data consumers (REQ-UC1-AC5, REQ-UCI1-AC6). 
Furthermore, data should be accessible to all authorized consumers of the data at the same time, re- 
gardless of preferred format (REQ-UC1-DM2), and data integrity should be maintained separately 
in isolation (REO-UCI-DM3). 


1.4.1 Pug - A Template Engine 


The Web UI is implemented using Node.js and the Pug template engine. Node.js is an open- 
source, cross-platform, back-end JavaScript runtime environment that runs on the V8 engine and 
executes server side JavaScript code [Fou]. Pug is a template engine for Node.js, which is used 
to render HTML pages. It is implemented in JavaScript and released under the MIT license. The 
Pug template engine converts the Pug code to HTML code at compile time. The benefit of using a 
templating engine is that it allows for the reuse of HTML page elements, while defining dynamic 
elements based on the data [sit]. An example of Pug code is shown below. 
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12:29 @ ®© 


MOSAICrOWN Dashboard 


Battery Level 


Car Controls 


Q Woodie's Blackpool 


O ¡Sports Direct 
O 
O 


INCOGNITO 


© Blackpool 
Shopping Centre 


CONFIDENTIAL 


PUBLIC 


Figure 1.3: Android Auto interface at the start of the journey. The driver has the option to select 


from any of the three policies 


doctype html 
head 
title Pug 
script (type=’text/javascript’). 
if (foo) bar (1 + 5) 
script (src=?’/javascripts/jquery.js?) 
hi Pug - node template engine 
#container.col 
p Pug is a terse and simple templating language. 


The above Pug code will be converted into the HTML code, by the Pug template engine. 


<!DOCTYPE html> 
<html lang="en"> 
<head> 
<title>Pug</title> 
<script type= text, javaccr@pt > if (foo) bar (let 5) </script> 
<script src="/javascripts/jquery.js"></script> </head> 
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Battery Level 


Car Controls 


Current Policy 


CONFIDENTIAL FOR 6 MONTHS 
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Figure 1.4: Android Auto interface after selecting policy for electricity cost per KWh. The driver 
has the opportunity the pay less for electric charge depending on how much PII they are allowing 


to collect 

<body > 
<hi>Pug - node template engine</h1> 
<div id="container" class="col"> 

<p>Pug is a terse and simple templating language.</p> 

</div> 

</body> 

</html> 


1.4.2 Web UI Structure 


The Web UI is structured to have different views depending on which user is logged in. Users 
for Web UI are MOSAICrOWN Cloud Provider, Car Driver, Fleet Owner and EV Charging In- 
frastructure Provider. According to the access rights, the data accessible to users will vary, and 
to accommodate that, the Web UI presents different views. The following sections explain the 
different views and data that are accessible by each user. 
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1.4.3 Maps 


A map is used to display EV charging stations and the current location of vehicles on the Web 
UI. The map is implemented using the HERE Maps software package, which is developed by 
HERE Technologies. In the Web Ul, we have adapted HERE Maps to implement the specific 
functionality required by UCI. The data shown in the map varies according to the user logged in, 
and dynamically shows the vehicles tracking across the map. Section [1.4.4] details the adaptions 
made to HERE Maps and illustrates the different display formats base on the user’s role defined 
on the UC1 MOSAICrOWN platform. Figure[1.5|shows an example image of a map from HERE 
Technologies [Tec]. 
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Figure 1.5: A basic HERE Map 


1.4.4 Data Access Types 


The data can be accessed through the Web UI using two different message formats: JSON or 
RDF (Resource Description Framework). The required data can be retrieved in RDF format using 
SPARQL queries. 


SPARQL Queries 


SPARQL is the standard query language for RDF triple stores, and structurally and syntactically 
it looks similar to SQL, but they differ semantically in their implementation. SQL searches sets of 
tables using a statically defined schema, whereas SPARQL searches a graph of nodes and edges 
by matching a set of patterns. RDF triple stores and their associated query language are often 
more flexible and expressive than conventional relational databases, as they are not bound by the 
limitations of a static schema. As the name triple store implies, RDF is defined in terms of a binary 
predicate, called a triple, consisting of a subject, an object, and a predicate that links them together. 
RDF triples are normally ordered: subject, predicate, object. The WHERE clause then consists of 
one or more triple patterns that the query must match, the *?” suffix indicating wildcards that can 
match any value. The results are returned as a list of triples that match the query. 
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In the listing below the SPARQL query returns any triples that have a model year equal to 2010”. 
Since the metadata knowledge base contains information about vehicles, the query will return all 
vehicles made in 2010”. All triples have a unique subject identifier, which in this case 1s the VIN 
(Vehicle Identification Number) of the vehicles. So, in this case, the query returns all vehicle VINs 
for vehicles made in 2010”. This is a very simple query provided as an illustration, but far more 
complex queries can be constructed, allowing the Web UI to extract very detailed and specific 
information. 


SELECT ?subject ?predicate ?object WHERE {?subject <https://uri.etsi.org/ 
ngsi-ld/default-context/modelYear> "2010" .} 


Dashboards in Web UI 


Web UI has four separate views, one for each user role. Depending on the user logged in, the 
dashboard will display a different view and thus give a different perspective on the data. The 
following paragraphs present details about each user role’s dashboard. 


MOSAICrOWN Cloud Provider Dashboard. The MOSAICrOWN Cloud Provider view has 
five tabs: Dashboard, History, Reports, SPARQL and Add New User. Figure[1.6|shows the Dash- 
board view. The user can see the locations and details of the EV charging stations, and the vehicle 
details in the dashboard. The user can also add new users in the Add New User tab, and submit 
custom queries for data from metadata knowledge base in the SPARQL tab. 


Dashboard History Reports SPARQL Add New User @ MOSAICrOWN Cloud Provider + 
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CAR DETAILS: 
Car License Plate | Destination | Price 
SFSQC8LARN BMW-5897 Kinsale Road 4.85 


69GL35T3GI BMW-9306 Kinsale Road 12.08 


Figure 1.6: Dashboard for MOSAICrOWN Cloud Provider 
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Figure 1.7: Dashboard for vehicle driver 


Vehicle Driver Dashboard. The Vehicle Driver view has four tabs: Dashboard, Charging Sta- 
tions, Reports and Pricing. Figure[1.7] shows the Dashboard view. The vehicle driver can see the 
details related to the vehicle. For example, the battery level, location of nearest charging station, 
current location, speed etc. 


Fleet Owner Dashboard. The Fleet Owner view has six tabs: Dashboard, Charging Stations, 
Reports, Cars, SPARQL and Fleet Management. Figure[1.8|shows the Dashboard view. The Fleet 
Owner can see vehicles and charging stations marked on the map, along with vehicle details in the 
dashboard. In the Reports tab, analytics (average speed, average battery level) will be displayed 
(1f already calculated), and/or we can calculate the average values and the data can be re-ingested. 
Fleet Management tab is used to link a vehicle driver to the VIN. This linking is required so that 
the vehicle driver can log in to see the details. 


EV Infrastructure Provider Dashboard. The EV Infrastructure Provider view has five tabs: 
Dashboard, Charging Stations, Reports, SPARQL and Pricing. Figure [1.9] shows the Dashboard 
view. The EV Infrastructure Provider can view the status of the EV charging stations. Also, the 
user can view analytics of average frequency of EV charging station being full/empty, which is 
overlayed as a heatmap on the map. 


Raw Data Access 


The data received from vehicles are stored as JSON files and RDF triple stores. The Web UI will 
facilitate accessing this raw data from storage for each user by adding the user credentials to the 
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Figure 1.8: Dashboard for Fleet Owner 
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Figure 1.9: Dashboard for EV Infrastructure Provider 
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query thus implementing access control, authorization and restrictions. The data available depends 
on the privacy of the data and access rights of the user. 


1.5 Policy Engine 


This section gives an overview of the Policy Engine developed by the academic partners, which 
1s detailed in Deliverables D3.3 “First Version of Policy Specification Language and Model” and 
D3.4 “Final Tools for the Governance Framework”, and explains the extension of its functionality 
that enables it to be integrated into the UC] MOSAICrOWN platform. 


1.5.1 Overview of Tools 


The Policy Engine 1s responsible for parsing the user defined policies and checking whether an 
access request for a given piece of subject data 1s permitted or denied. 

For UCI, the Policy Engine was implemented as an access control mechanism, operating on re- 
quests from the Web UI to the data market. A subject in this context refers to the vehicle and 
related metadata. Access to data 1s validated through the policy linked to the vehicles metadata. 
The policy defines the permitted data points and the context in which they can be accessed, in- 
cluding its intended purpose, the action to be performed, and user making the request. All access 
requests to data in the data market are governed by the Policy Engine. The modifications of the 
different components of the Policy Engine are described next. 


1.5.2 Modification of Tools 
Containerization 


The Policy Engine was containerized using Docker and deployed and managed with the other core 
components of the platform. This modification enables smoother integration and management of 
the tool with other components, as well as a portable and reproducible deployment of all compo- 
nents of the platform. Deploying the Policy Engine in a containerized environment also allowed 
for greater control in the definition of communication mechanisms with other related components 
of the platform. 


API 


For UCI, communication between the Policy Engine and Web UI is made through HTTP requests. 
The API was developed using Flask, which is a Python based web framework. The primary role 
of this API is to receive and parse incoming HTTP requests, passing the incoming request to the 
Policy Engine Front-end and Core in a compatible format. Flask 1s used to wrap the Policy Engine 
and expose it as an access control service within the UC] MOSAICrOWN platform. The API 
provides two endpoints: a subject query, where a specific ID (subject) is defined, and subjectless 
queries, which only define predicates to query. Both of these endpoints implement a remote policy 
loading function to retrieve the subject’s linked policy data from the RDF metadata triple store, 
before passing the policy and query to the Policy Engine Core. If the Policy Engine Core permits 
the subject access request, data (specified by the SPARQL query) is returned to the requester, in 
this case the Web UI. Figure[I.10]illustrates the Policy Engine API data flow. 
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Figure 1.10: Policy Engine API data flow 


{ 

"query": "SELECT ?vin ?seatLoc WHERE {?vin <http://dellemc.com 
:8080/icv/driverSeatLocation> ?seatLoc .}", 

"type": "vehicle", 

"user": "fleetmanager", 

actiom.:. head. 

"purpose": "statistical" 

} 


Figure 1.11: Parameters of a subjectless API endpoint query 


Remote Policy loading 


The Policy Engine was extended with functionality for loading a MOSAICrOWN Policy at request 
time. Using the Policy Engine’s SPARQL query parsing, either one (ID) or many (subjectless) 
identifiers are extracted (Figure (1.11). Using these ID(s), a request is made to the RDF metadata 
triple store requesting a policy for each ID. These policies are parsed and loaded into an in-memory 
RDF graph before being passed to the Policy Engine Core with the data request. 


1.6 Encrypted File System 


The encrypted file system provides UC1 with data wrapping capabilities. The data market filter 
component can send data for data wrapping, if encryption on that data is desired. The encrypted 
file system utilizes two tools developed by MOSAICrOWN consortium partners; namely, the 
FreyaFS encrypted file system which leverages the aesmix all-or-nothing transform Mix&Slice 
encryption library which were detailed in Deliverable D4.1 “First Version of Encryption-based 
Protection Tools” and D4.3 “Final Encryption-based Techniques”. This section details the work 
carried out to extend the functionality offered by these tools. 
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Figure 1.12: Architecture of initial FreyaFS utility 


1.6.1 Overview of Tools 


The FreyaFS encrypted file system provides the novel encryption of aesmix seamlessly and trans- 
parently by hooking in directly to the operating system’s file system operations as an intermediary. 
FreyaFS uses a FUSE (file system user space) mount point to mount a local directory. Upon ac- 
tivation, FreyaFS ties up the current terminal as it runs as a service. Files created within, or 
moved into, the mounted directory are seamlessly encrypted using aesmix. This work addresses 
the requirements for centralized key management infrastructure (REQ-UC1-AC3) compression 
and encryption (REQ-UC1-DI6), protection at rest and in transfer (REQ-UC1-DM5S). 

The all-or-nothing-transform (AONT) Mix&Slice encryption mode creates a fully bit- 
interdependent representation of the source data via mixing. This encrypted representation is also 
sliced into fragments. Every fragment is necessary to successfully decrypt, and even one missing 
fragment prevents the decryption. 


1.6.2 Modification of Tools 


This section covers the main technical outcomes of the development of the encrypted file system 
tools. Three primary modifications have been carried out: the containerization of the FreyaFS 
encrypted file system, the investigation of container state externalization, and the integration of 
external key management, which are explained in the remainder of the section. These extensions 
are inherently interoperable and form a novel extension to FreyaFS to meet the requirements of 
UCI. 


Containerization of FreyaFS 


Work was carried out on containerizing the FreyaFS utility. Instead of FreyaFS running as a service 
on the host, the process now runs in a container which can be flagged to run in the background. 
This means that many instances of the utility can easily be run at the same time on one machine, 
whic has as result a more graceful starting and stopping of the service. The directory which 1s 
mounted using the FreyaFS tool exists within a container volume. This approach circumvents the 
issue of the containerized application itself losing state upon the container restarting. 

The containerization of FreyaFS changes the method by which files are accessed. In the original 
version of FreyaFS, other use-case one components could read or write data directly into the 
FreyaFS mounted directory which was on a host’s local file system. The method of file access for 
the containerized FreyaFS tool utilizes Docker container storage interface (CSI) plugins to enable 
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Figure 1.13: Architecture of modified FreyaFS utility 
a direct pipeline from UC1 components, such as the data market filter to the encrypted file system. 


Externalizing Container State 


Exploratory work to build upon the utilization of container volumes was carried out by leverag- 
ing an external object storage platform to facilitate stateful container migration by making state- 
essential application data (such as FreyaFS metadata files) remotely accessible to containers. This 
means that a container can migrate between hosts, access its externalized data from remote stor- 
age, and then resume operation from its most recent copy of the data. The exploratory work fed 
into the externalization of the entire FreyaFS mounted directory to remote storage, rather than 
using a local directory. The system saves the FreyaFS data within an S3 bucket. The S3 bucket 1s 
exposed to a containerized application using a common storage interface driver to mount it locally 
to appear as a standard container volume. EISI’s open source ’REX-Ray Driver’ is the common 
storage interface plugin used to implement this functionality. 


Integration of KMIP 


The FreyaFS utility was extended with an OASIS Key Management Interoperability Protocol 
client using PyKMIP, which allowed for the replacement of the integrated password generator 
in FreyaFS with a remote key management server that provides cryptographic keys and manage- 
ment functionality to FreyaFS, and more specifically the aesmix cryptography library running in 
the background. FreyaFS being written in the Python language meant the PyKMIP extension can 
be included directly into the code. 

The KMIP functionality currently operates using the PyKMIP client integrated into FreyaFS and 
a PyKMIP test server, but any OASIS KMIP standard server is compatible with the system. The 
KMIP-enabled FreyaFS version is integrated into the containerized build, and the container com- 
municates with an external KMIP server to fetch keys. Any KMIP compliant server could be 
used with this system. The primary motivation behind the addition of a KMIP client into FreyaFS 
and supporting external key management was to address key management requirements as well as 
improve the general security of the tools in UCI. 
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1.7 Summary 


This chapter presented an overview of UCI and linked technological approaches. It demonstrated 
how the results of the MOSAICrOWN can be applied to the development of an automotive data 
market, where confidential information can be traded in a safe and secure manner. In particular, 
it shows the value of integrating the Policy Engine, developed by the academic partners, to effec- 
tively control access to data indexed by a metadata knowledge graph. The main items of work 
detailed in this document were: the Automotive Tools, the Web UI, the wrapping of the Policy 
Engine, and the Encrypted File System. 

The use case implementation has provided us with practical experience of how to build a flexible 
and extensible data market based on emerging technologies, such and knowledge bases, RDF 
based access policies, and dynamically encrypted file systems. This design will help form a 
blueprint for future customer solutions that provide a way to safely and securely monetize data 
in an open market. 
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2. Use Case 2 (MC): Tools for financial data 
markets 


This chapter details the tool to satisfy Use Case 2 (UC2) focusing on transaction-level financial 
data and on the importance of data wrapping techniques for the final purpose of analytics and the 
extraction of business insights from the data itself. UC2 considers requirements in Deliverable 
D2.1 “Requirements from the Use Cases” and the tools developed are layered as below. 


Presentation. UC2 was designed as a Web Application where each user can access to the platform 
uploading their dataset and generating analytics after data anonymization. 


Application. The application layer manages the logic and the engine of the platform. It has three 
different components: 


e Identity layer allows users to be authenticated (Identity Server). 


e Wrapping layer analyzes the dataset provided and stored in the data layer and, accord- 
ing to that, defines and applies the wrapping techniques for each field. 


e Analytics layer once data is anonymized the users can access a dedicated dashboard 
to generate insights based on analytics built on the anonymized data. The analysis 
provides aggregated results, so no granular information is presented. The user must 
use a Specific data-upload template, related to different industry and markets. 


Data. The users, once connected to the platform via a web browser, upload the dataset to be 
anonymized. The platform proposes wrapping techniques for each field (according to the 
policy regulation selected), which can be confirmed or overwritten by the user, the platform 
anonymizes the dataset. 


This chapter covers the cloud-based platform that anonymizes Personally Identifiable Information 
(PII). The application is designed to answer different user needs detailed as two different user jour- 
neys. One journey is dedicated to the data anonymization and the other one focuses on generating 
analytics from the anonymized data, as detailed next. 


User journey 1 (Data Wrapping and Policy Language). The user selects a policy in alignment 
to the type of data and/or regulations to be applied. Then, the platform runs a semantic 
analysis of the data provided, recognizing the data types. A semantic and data distribution 
of the dataset is uploaded by the user, and accordingly wrapping techniques are applied. 
The wrapping techniques can also be overridden directly by the user, in addition to the 
predefined wrapping techniques defined for each metadata in the policy. Once the file is 
anonymized, the user can download the fully anonymized dataset. During the upload phase, 
the user has the possibility to choose if the uploaded dataset can be considered inside the 
Analytics section or simply pass through the actual anonymization phase. If the uploaded 
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dataset respects a predefined template, the dataset will be made available in the Analytics 
section once anonymized for analytics visualization. 


User journey 2 (Data Analysis). Once the anonymization and the User journey 1 is completed, 
the user can access the Analytics section which utilizes the anonymized data. The user can 
visualize the summarized dashboard as well as each analytics page giving a segmented view 
by: 


e product 
e channel 
e card type 


e category 
The related evolution during the time can be visualized: 


e monthly 
e quarterly 


e yearly 


The dataset needs to respect a predefined template to match the types of analytics above (1.e., 
product, channel, card type, and category), and for the platform to be able to generate analytics. 


2.1 Background 


2.1.1 Overview 


Organizations must develop effective data strategies and utilize privacy-enhancing techniques to 
meet regulation requirements and consumer expectations, and to continue to innovate. Data have 
the potential to fuel innovation, but only 1f data practices are held to the high standards that cus- 
tomers and partners deserve with respect to the privacy regulations. The different privacy regu- 
lations are creating both challenges and expectations for innovation, differentiation as well as an 
opportunity to build trust with consumers. The regulations are also constraining market players to 
build privacy-by-design into standard business processes as well as leveraging best practice data 
management standards, like data minimization, to enable continued innovation. The adoption of 
data anonymization expands the amount of data we can use for analytics. This must be done allow- 
ing organizations to include in the source dataset not only current customers, but also customers 
who left the organization in the previous years. Anonymization increases data reliability of the 
insights provided, particularly the ones related to risks and frauds. 


Context in MOSAICrOWN 


The main focus of this chapter 1s within Work Package 2 (WP2) of MOSAICrOWN, and describes 
the solution developed in the context of UC2. The solutions are based on result of works and 
findings from Work Packages 3-5 and the focus is on preserving confidentiality of data for an- 
alytical purposes as per UC2 scope of work. This chapter relates to WP2 focusing on defining 
requirements of the different use cases and captures the final versions produced. 
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UC2: Background on core technologies 


UC2 considers several key features, such as the recognition of the different level of Personally 
Identifiable Information (PII) in financial data. Also, UC2 considers the application of adequate 
wrapping techniques according to the recognized data type and the possibility for the user to 
adjust both the data type and wrapping technique to be applied. Data governance, wrapping, and 
Sanitization are key components of UC2. 


2.2 Solutions 


Functionality and goals 


UC2 is a cloud-based platform, where each user can upload a data file for anonymization and later 
generates analytics from the anonymized dataset. 

The user selects a policy which is the interpretation of privacy regulations. The policy defines a 
set of wrapping techniques to apply to generate an anonymized dataset compliant to regulation. 
Then, the platform runs a semantic analysis of the data provided, recognizing data types, semantic 
and data distribution of the dataset uploaded by the user; accordingly wrapping techniques to be 
applied are proposed to the user. Different wrapping techniques are proposed by the platform 
according to the policy selected, the data type, and the data distribution reported in each field; 
nevertheless the user can still customize the wrapping techniques to be applied. The final output, 
the anonymized file, can be downloaded by the user with the fully anonymized dataset. 

On top of the application and customization of anonymization techniques on data, the data are also 
processed to generate analytics. 


Policies 


Figure [2.1] is an example of sample dataset with the policy file defined for it by the pilot. A pol- 
icy, which is a formalization of the privacy regulation in a technical language, provides a schema 
combining data and anonymization techniques based on privacy requirements defined in the reg- 
ulations. For each data type, the policy proposes a selected wrapping technique applicable. Once 
the dataset analysis is completed, the user can also select a different wrapping techniques from 
the one proposed by the policy, to anonymize the related data field. The platform can manage 
different privacy regulations and, for each of them, a configuration file, in Jason format, 1s created 
and uploaded into the system. This allows the platform to maintain the policies stored according 
with the evolution of the regulation over the time. 


2.3 Architecture and Components 


2.3.1 Client-Server Model 


The solution architecture is based on three different tiers: presentation, application, and data as 
shown in Figure|2.2] 

In the presentation tier, there is the web application layer, utilized by users to access the platform, 
upload their datasets to be anonymized and go through the analytics section. This tier communi- 
cates with the application tier and allows users to identify themselves and be authenticated (Iden- 
tity layer), upload datasets, choose anonymization techniques, receive datasets in anonymized 
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Figure 2.1: Configuration file example 


Figure 2.2: Three tier architecture 
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format and see aggregated values available in the Analytics (Analytics KPIs) and insights (API 


layer). In the application layer there are: 


e Identity layer 


e Web server/API layer 


e Data semantic/data wrapping Engine (DSDW Engine) 


e Insights/KPIs Engine 


The Identity layer and the Insights/KPIs Engine have been developed in C# while Web server/API 
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layer and DSDW Engine have been developed using Python and Flask framework (a web frame- 
work for Python) and then placed inside a docker. The Identity layer is used to authenticate users 
providing them a token with a validity time, enabling them to use all the services offered by the 
platform until the valid time expires. The Web server/API layer serves the contents to be shown 
on the web interface, communicates with the DSDW Engine to perform the anonymization pro- 
cesses and with Insights/KPIs Engine to feed the Analytics section. The DSDW Engine has two 
tasks: i) semantically recognize the content of the dataset, ii) perform the anonymization processes 
requested by the user. 

In the data tier, all the required data structures needed to the correct behavior of the platform. In 
details: 


e Job database contains the processes (jobs) that perform data semantic detection and data 
wrapping techniques application. These jobs have a duration that varies depending on the 
number of rows and columns in the dataset. 


e Data Wrapping techniques database contains the set of wrapping techniques associated with 
the various semantic types divided by privacy policies (e.g., GDPR, Brazilian privacy regu- 
lation LGPD). 


e User database contains the set of authorized users able to access the platform. 


Analytics database contains all the data to be used inside the Analytics section. 


e Temp files bucket is a bucket containing all the temporary files produced by the execution of 
the data semantic detection and data wrapping techniques application jobs. These temporary 
files are immediately deleted once they are considered no longer needed for the subsequent 
steps of the platform. 


All databases inside the presentation tier are relational, with a predefined schema. 


2.3.2 User Interface 


The user interface has been developed in Angular (Angular is a platform and framework for build- 
ing single-page client applications using HTML and TypeScript). They drives, by a dedicated 
workflow, users through the various steps in order to receive a dataset containing anonymized data 
according to the selected wrapping techniques. The UI interacts with a backend layer through 
authenticated REST API calls (the authentication token provided by the Identity server must be 
inserted in every call made after login). Once logged into the platform, the user is asked for the 
policy to be applied. The user is also asked to select the user journey, which is either only the 
dataset anonymization or the dataset anonymization and the generation of analytics, as shown in 
Figure[2.3] After the policy and the user journey have been selected, the platform asks the user to 
upload the dataset (the expected file types are xls, xlsx, csv). If the user selects data anonymization 
and analytics generation, the dataset format needs to match the analytics structure. If the dataset 
uploaded does not match this format, the user receives an error message, as shown in Figure [2.4] 
and the option to download the correct template for the dataset uploaded. Once the analysis of 
the loaded dataset is finished, a sample of the data is shown and, for each column, there is the 
semantic type of the dataset columns that DSDW Engine has predicted. However, as shown in 
Figure [2.5] the user has the possibility to change the semantic type of each column by choosing 
one contained into the policy previously selected. Figure [2.6] visualizes how the platform shows, 
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for each column, the set of the data wrapping techniques that can be executed on that particular 
semantic type. Once the user has chosen the data wrapping techniques to be applied, the platform 
allows the download of the anonymized dataset as shown in Figure[2.7] 


5 ANALYTICS @ LOGOUT 
safely 
UPLOAD oe 
your dataset 


Q 
® Anonymization 
© Anonymization + Analytics 


by uploading you agree to example's terms and condition & 
privacy policy 


Figure 2.3: Select privacy policy and user journey 


MOSAICrROWN 


Uploaded dataset does not respect the template 
required for the Analytics section. 
Proceed to the anonymization step. 


Figure 2.4: Error message and data upload structure template 


Analytics Section 


The user has the possibility to go directly to the Analytics section using the link on the top right 
corner of the platform near logout button. The Summary tab, Figure [2.8] shows some aggregated 
KPIs with respect to transactions, average spend per ticket, MoM/QoQ/YoY values based on the 
last month/quarter/year of data in the database and a section containing several pie charts showing 
some metrics divided by channel, by products and by industry. 
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Figure 2.5: Data semantic type detection 
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Figure 2.6: Data wrapping techniques 
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Figure 2.7: Download anonymized dataset 
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ANONYMIZATION @ LOGOUT 
Analytics Section 
SUMMARY OVERALL BY PRODUCT BY CHANNEL BY CARD TYPE BY CATEGORY 
Y 16.769 lx 97,469 
TRANSACTIONS AVERAGE SPEND PER TICKET 
CHANNEL PRODUCT INDUSTRY 


Arm 34.9% BBR CARD PRESENT 34.9% GB DEBIT 44.4% [CREDIT 40.4% PREPAID 15.2% BB RETAIL 38.8% 


Figure 2.8: Analytics section - Summary tab 


The Overall section, Figures[2.9]and|2.10] shows the set of all aggregated data with the possibility 
of: 1) reducing the reference time interval (monthly, quarterly, yearly), and 11) selecting the index 
on which the underlying insights are calculated (spend, transactions, average ticket size). The 
graph in Figure|2.10|shows the temporal trend of the selected index. 


ANONYMIZATION @ LOGOUT 
Analytics Section 


SUMMARY OVERALL BY PRODUCT BY CHANNEL BY CARD TYPE BY CATEGORY 
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R 


Index: [SPEND ~) 


AVERAGE ON THE SELECTED MONTHS 


GEOGRAPHY |CROSSBORDER v 


152,688.2 


141,9438 


130,463.3 


Figure 2.9: Analytics section - Overall view (1/2) 


In the tab By Product in Figure (and also in all the following sections), different KPIs are 
shown, depending on the section itself, with the same possibility of changing the time interval and 
the reference index. 
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Figure 2.10: Analytics section - Overall view (2/2) 
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Figure 2.11: Analytics section - By product 


2.3.3 Application Review 


The proof-of-concept consists of three applications, one for each functionality of the platform: 


e The core of the solution, which is based on DSDW Engine and its functionalities. Once 
the dataset is uploaded by the user, the same dataset is saved in the bucket named with a 
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unique alphanumeric identifier. A new job is created in the job table associated with that 
temporary file. The DSDW Engine takes over the job and performs a pre-processing phase 
on the received dataset based on tokenization/embedding extraction processes (required for 
the next phases). 


After this phase the DSDW Engine computes, for each column of the dataset, the probability 
that the column belongs to a particular semantic type according to its content. The semantic 
type with the highest probability 1s selected by default but the user can modify it if needed. 
Once the semantic type of each column has been identified, the job status is updated to allow 
the interface to show the result of the analysis. Once the user has selected for each column 
the data wrapping technique to be applied, the DSDW Engine applies the techniques on the 
original dataset creating a new temporary file (containing the data in anonymized form). 
During the upload phase, if the user has selected the request to upload the dataset in the 
Analytics section, the anonymized dataset will be uploaded to the analytics database. This 
file 1s then made available to the user who has the possibility to download it. 


e Clicking on the Analytics link, the user has the possibility to browse the Analytics section 
containing the aggregated KPIs of all the datasets uploaded up to that moment, divided by 
sections of interest (By Product, By Channel, By Card type, By Category). 


2.3.4 Process Diagram 


Figure[2.12]describes the overall process of data analysis and wrapping, starting from an unstruc- 
tured dataset provided by the user. The process is fully asynchronous (mandatory considering the 
amount of data that could be involved in this data transformation). The user can run the process and 
retrieve the related job monitoring the execution. The same approach is used for the application of 
the selected data wrapping techniques considering also the possibility to upload the anonymized 
dataset on analytics database if the dataset format is the format required by the platform. 


2.3.5 Sequence Process 


Figure shows the different sequences we defined in the pilot, and the different components 
involved related to the anonymization process. Figure [2.14] demonstrates the different sequences 
we defined, and the different components involved in the different layers of the platform related 
to the analytics process. Regarding the Analytics section of the platform, the API requires the 
following parameters: Section, StartDate and EndDate. Section refers to the type of analytics — 
1.e., Summary, overall, by card type, by category, by channel and by product; as shown as tabs in, 
e.g., Figure [2.8] StartDate and EndDate define the requested time frame for the analytics result. 
The API layer provides the data to populate the requested fields in terms of dimensions, KPIs and 
graphs. 


2.4 Summary 


This chapter outlined the final prototype for the UC2: data sharing and analysis through data 
anonymization techniques and compliance and the use of the anonymized data to generate analyt- 
ics. Specifically, we illustrated how UC2 is realized via a cloud-based storage solution compatible 
with any public cloud provider. Demo access is provided via a central web front-end. We see the 
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Figure 2.12: Process diagram 


MOSAICrOWN Data Semantic Database 
Backend Model Server 


GET/Login Validate user login-———————————_»> 
+m Auth. Token—————— User info 
———— GET /PrivacyPolicies Query Privacy Policies table-——4#4» 
< WY Privacy Policies data———\ Privacy Policies data 


POST/UploadDataset reate new job > |[_ 

<+————— Job identifier: | ob identifier ccc 
: —Run model on uploaded dataset—> 
a ———— Save results on Job tabl 

———GET/JobStatus? Jobld=XXX Query Jobs table with Job identifier ————___—-» [7 
+m i ob result——_o | “+ nivacy Policies data——_ o£ §—@—@§|—§_ 

POST/DataWrappi uery Privacy policies table to get associati 

Column types Data wrapping techniques associations 
DW techniques mapping : 


Figure 2.13: Sequence process - Anonymization 
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Figure 2.14: Sequence process - Analytics 


presented prototype as an important milestone to enable significantly more secured information 
sharing between multiple parties with the guarantee to preserve PII information. UC2 prototype 
has been designed for the FSI (Financial Services Sector) industry, and particularly digital pay- 
ment analysis. Nevertheless the platform is industry-independent and can manage any type of data 
to generate insights. 
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3.1 Introduction 


This chapter covers Use Case 3 (UC3), a cloud-based data market for privacy-preserving analytics. 
UC3 considers two parties in a business-to-business scenario that want to compute privacy- 
preserving statistics over their joint data. Data sanitization that permits meaningful insights is 
the main technological issue of this use case. Differential privacy, a privacy notion satisfied by 
randomized algorithms that, e.g., perturb an analytical output, 1s the main technology applied in 
this use case. However, it can be augmented with different wrapping techniques investigated in 
WP4, which provide complementary protection, e.g., encryption at rest. The sanitization is en- 
sured at different points throughout the data life-cycle, covering all phases: ingestion, storage, and 
analytics. Figure|3.1]shows these different phases and where sanitization can be applied. 


ÓN 
ngestion 
A 

3 Storage 
E 


> ES 


Analytics 


Figure 3.1: UC3 Overview 


In option 1, the data is directly anonymized during ingestion, 1.e., only anonymized data is stored. 
In option 2, the data is stored in plaintext (or encrypted) and only anonymized on demand (e.g., 
with regards to a chosen mechanism or analytical function). In option 3, the data 1s directly ana- 
lyzed between the parties and they only learn the anonymized analytical result over their joint data. 
The tools for Use Case 3 cover the options mentioned above: The anonymization tool DPtool, de- 
tailed in Section[3.3.1] covers option 1 as well as 2. The privacy quantification and anonymization 
tool MIA, described in Section [3.3.2] also covers option 1 and 2. The secure computation tool 
DPsc, detailed in Section[3.3.3| covers option 3. 

The remainder of this chapter is organized as follows. First, we provide preliminaries and techno- 
logical background for differential privacy as well as secure computation in Section[3.2] Then, we 
provide tools to satisfy the requirements of UC3, namely, DPtool, MIA and DPsc, in Section[3.3] 
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3.2 Background 


In this section, the required background for the tools of this use case is briefly presented. First, 
Section describes anonymization mechanisms for differential privacy. Then, Section [3.2.2] 
details cryptographic tools for secure computation. 


3.2.1 Background on Differential Privacy 


In the following, we recall some preliminaries and definitions for differential privacy. Further 
details, can be found in deliverables D5.1 “First Version of Data Sanitisation Tools”, D5.2 “First 
Report on Privacy Metrics and Data Sanitisation”, and D5.3 “Final Report on Privacy Metrics, 
Risks, and Utility”. We model a data set D = {do,...,d,_1} as n elements from a data domain U. 
A neighboring data set D' can be created from D by removing or adding an element. 


Differential Privacy 


Definition 1 (Differential Privacy). A mechanism M satisfies (€, ô )-differential privacy, where 
€,0 > 0, if for all neighboring data sets D and D’, and all subsets S of Range(M) 


Pr[M(D) € S| < exp(€)-Pr{M(D") € S] +6, 
where Range(M) denotes the set of all possible outputs of mechanism M. 


Note that the definition above does not set any bounds on an adversary. However, due to 
our use of secure computation, we also require a definition with computationally bounded ad- 
versaries. Mironov et al. define computationally indistinguishable differential pri- 
vacy (IND-CDP) for two-party computation (2PC) with computationally bounded parties. He et 
al. adapt the definition of Mironov et al. for parties A,B with data sets D4, Dg, privacy 
parameters €4,€g and security parameter À. Furthermore, VIEW, denotes the view of A during the 
execution of protocol II. 


Definition 2 (IND-CDP-2PC). A two-party protocol II for computing function f satis- 
fies (€4(A), €g(A))-indistinguishable computationally differential privacy (IND-CDP-2PC) if 
VIEW (D,4,-) satisfies €g(A)-IND-CPA, i.e., for any probabilistic polynomial-time (in À) adver- 
sary A, for any neighboring data sets (Dg, Dr) 


Pr|A( VIEW) (D4,Dp)) = 1] 
<exp(€g)- Pr[A( VIEW (Da, Dh)) = 1] +negl(A). 


Likewise for B’s view for any neighbors (Da, D',) and £4. 


For convenience, we use € = €4 = Ep. 


Mechanisms 


Randomized algorithms, called mechanisms, are required to satisfy Definition|T] 
The Laplace mechanism |DR14] works by adding noise sampled from the Laplace distribution to 
a function evaluation. 
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Definition 3 (Laplace Mechanism). The Laplace mechanism for function f : U” — R which has 
l,-sensitivity Af = maxyp~p’ | f(D) — f(D’)|, 


f(D) + Laplace (Af /€), 


releases 


where Laplace (b) nae a random variable from the Laplace distribution with scale b and den- 
sity Laplace (x;b) = 5 exp (- Ht). 


The Laplace mechanism satisfies (€,0)-DP, also called pure differential privacy. Similarly, the 
geometric mechanism adds noise from the symmetrid'] geometric distribution. The sym- 
metric geometric distribution 1s a discrete version of the Laplace distribution, 1.e., 1t operates over 
integers, and is especially suited for anonymizing integer counts. 


Definition 4 (Geometric Mechanism). The Geometric mechanism for function f : U” — R with 
[¡ -sensitivity 1 releases 
f(D) + Geometric (exp(—€)), 


where Geometric (b) denotes a random variable from the symmetric geometric distribution with 


scale b and density Geometric (x;b) = l- m TO phl, 


The geometric mechanism satisfies (£,0)-DP. Another mechanism, called Gauss mechanism 
IDR14], uses noise from the Gauss distribution. 


Definition 5 (Gauss Mechanism). The Gauss mechanism for function f : U” — R which has l>- 
sensitivity Az f = maxyp~p' | f(D) — f(D’) 


>, releases 
f(D)+N(0,0°), 


where O > cAgf /€ with c? > 2log(1.258), and N(0,0?) denotes a random variable from the 


Gauss distribution with scale 0? and density N(x;0,07) = - 77 exp(-5(2)?). 


The Gauss mechanism satisfies (€, 0)-DP, also called approximate differential privacy. The expo- 
nential mechanism [M'T'07] computes utility scores for a fixed set O of possible outputs and proba- 
bilistically selects an element based on its score. The formal definition is according to [ILLSY 16]. 


Definition 6 (Exponential Mechanism). For any utility function u : (U” x O) > R and a privacy 
parameter €, the exponential mechanism ME (D) outputs o € O with probability proportional to 
exp (22), where 


Au= max |u(D,o)— u(D',o) | 


YoEO,D~D! 


is the sensitivity of the utility function. That is, 


exp (24221) 


€ 
O (RAN 
The exponential mechanism satisfies (€,0)-DP. While the mechanisms based on directly adding 
noise are simpler than the exponential mechanism, the former cannot be applied on arbitrary out- 
puts (e.g., strings) whereas the latter supports arbitrary output domains. Furthermore, the expo- 
nential mechanism provides better accuracy for certain tasks compared to the Laplace mechanism, 
which includes the median [BK20b][MG20]. 


! Also called two-sided or double. 
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Inverse Transform Sampling for Exponential Mechanism 


The exponential mechanism provides selection probabilities for each output element, 1.e., it in- 
duces a probability distribution over the output domain ©. To sample from this distribution, we 
use inverse transform sampling (ITS). ITS uses the uniform distribution to simulate any other dis- 
tribution. First, one samples a random element r € (0.0, 1.0] uniformly. Then, one finds the output 
element o; € O such that ES Pri ME(D) =0¡] <r < ys PrM*(D) = o;|. Finally, one outputs 
oj. To illustrate the intuition behind ITS, assume O = {a,b} where the selection probability for 
ais 10% and for b it is 90%. Now, say we fill an array L of size 100 with 10 elements a and 90 
elements b and select an array index i at uniform random and output Li]. Then, £li] is a with 10% 
probability and b with 90% probability. 


Geo Indistinguishability 


Geo indistinguishability [ABCP13] is a generalization of DP, developed for location privacy, 1.e., 
it allows to randomize coordinates. It generalizes DP to arbitrary metrics between inputs. 


Definition 7 (Geo Indistinguishability). A mechanism M satisfies €-GI iff for all points x,x': 
dp(M/(x), M(x')) < €d(x,x’), 


where d is the Euclidean distance and dp the distance between the distributions over M on inputs 
x,x’, respectively. 


A mechanism M satisfying GI for a two-dimensional point y = (y1,y2) reports a noisy version of 
y instead, based on the planar (i.e., two-dimensional) Laplace distribution. Informally, M draws 
a uniformly random angle a in (0,27), and draws a radius r by sampling from the Gamma distri- 
bution (which generalizes the Laplace distribution) with shape 2 and scale 1/¢. Then, it outputs 
(yı +rsin(a),y2+rcos(a)). 


3.2.2 Background on Secure Computation 


Secure two-party computation allows two parties A and B with inputs x4 and xp, respectively, to 
compute a function y = f(x,4,xg), such that the parties only learn y and nothing more. In other 
words, the parties jointly compute f but never learn each other’s inputs. 

We focus on semi-honest adversaries, also called passive adversaries, which do not deviate from 
a secure computation protocol, but try to learn as much as possible from the protocol execution. 
Malicious adversaries, also called active adversaries, on the other hand, try to alter the protocol 
execution and might provide malicious inputs. 

The two main implementation paradigms for secure computation are secret sharing 
and garbled circuits [Yao86]. The former considers arithmetic circuits (arithmetic operations, e.g., 
addition over integers) the latter Boolean circuits (logical operations, e.g., AND over bits). 


Additive Secret Sharing 


For additive secret sharing, we assume all values to be in Z,» and that all operations are performed 
modulo 22». We consider two parties A and B, where party A holds a secret s € Z,». The parties 
want to split the secret into two parts s4,5,, where A holds s4 and B holds sg, such that both parts 
are required to reconstruct the secret. 
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Such splitting can be achieved by first drawing a uniformly random value u € Z», then setting 
SA = S — u and sg = u and distributing it to the corresponding parties. We write (s) = (s4,sg) to 
refer to the secret sharing of s. 

Addition with sharings (s), (r) is straightforward as (s +r) = (s) + (r) = (sa +ra,sg +rg). Mul- 
tiplication with a public value t € Z» is also simple as (ts) = t(s} = (ts4,tsg). To multiply secret 
shares (x) and (y), however, requires further techniques such as Beaver triples. Beaver triples 
are shares (a), (b), (c) such that c = a-b [Bea91]. With such a triple, the parties now compute 
(x— a), (y —b), and reconstruct (x — a) as œ = x—a and reconstruct (y —b) as P = y — b. Thus, 
they can express the multiplication of shares as (xy) = (c) + a@(b) + B(a) — A - B. Such Beaver 
triplets can be constructed via oblivious transfer [DSZ15l, which we describe next. 


Oblivious Transfer 


Oblivious Transfer (OT) is a cryptographic primitive that is equivalent to secure computa- 
tion [[Ki188]. In other words, OT is sufficient to construct any kind of secure computation, making 
it a very powerful tool. Specifically, the functionality for 1-out-of-2 OT considers two parties: 
a sender with two secrets s¡,s and a receiver. The receiver wants to learn one of these secrets 
but in such a way that the sender does not discover which one. A simple OT construction, based 
on the Diffie-Hellman key exchange [DH76], is given by Chou and Orlandi in [CO15]. While OT 
requires costly computations (asymmetric key cryptography and therefore, e.g., modular exponen- 
tiations), there are efficient constructions, so called OT extensions [Bea96] [IKNP03]], which use a 
few (base) OTs to build larger, logical OTs. 


Garbled Circuits 


A Boolean circuit consists of logical gates connected by wires which move the output from one 
gate to the input of another gate. The functionality of a gate is described by the so-called truth 
tables which map inputs to their outputs. Garbled circuits are Boolean circuits, where each gate’s 
truth table is “garbled”. Given two parties A and B, the four possible inputs of a garbled truth table 
are not bits, but random labels, e.g., encryption keys. Thus, one can envision a garbled truth table 
as a double-encrypted table, where one key corresponds to the input bit from party A and the other 
one from party B. 

Next, we informally describe garbled circuits, for which a formalized description due to Bellare 
et al. is provided in Deliverable D5.4 “Final versions of tools for data sanitisation and 
computation”. One party acts as the garbler and garbles the gates by creating the random labels. 
In other words, this party creates the garbled circuit. The other party acts as the evaluator and 
evaluates the garbled circuit. Note that the evaluator cannot learn labels for all possible inputs 0, 1 
per wire — as otherwise the evaluator can evaluate more than only its own input and learn more than 
only the agreed upon function output. Also, the garbler cannot learn the evaluator’s input per wire 
— as this directly reveals the evaluator’s sensitive input. To solve this problem, oblivious transfer 
is used. Thus, the garbler sends the evaluator only one of two possible labels /p,/; (e.g., lo if the 
evaluator’s input for the current gate is 0) and the garbler does not learn which one. Afterwards, the 
evaluator knows the garbled circuits with the garbled labels (which look random to the evaluator) 
and its own input labels and can evaluate (1.e., decrypt) each output label (1.e., next key) and use 
it to evaluate the next gate, etc. The garbler also produces an output translation table, which maps 
the final random output label to its corresponding plaintext result. 
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Converting between secret sharing and garbling 


To use both schemes — secret sharing and garbled circuits — in one protocol, conversions between 
their value representations — secret shares and garbled values — are required. 

To convert an additive secret share to a garbled value, one executes a garbled addition circuit where 
the secret shares are the inputs. The other direction, garbling to secret sharing, requires a garbled 
subtraction circuit C. Let x be the value to convert. Then, the garbler, e.g., Party A, samples a 
random value r, and circuit C receives inputs |x], [r] and outputs [x — r] (using modulo) to the 
other party, who is allowed to decode this garbling and learn x— r. Now, A holds r and B holds 
x— r, which is a secret sharing of x. For further details and more efficient conversions we refer to 


IDSZ15). 


3.3 Solutions 


In this section, we present the different tools to enable UC3. 

In Section|3.3.1| we detail DPtool which provides a REST API as well as a GUI for differential 
privacy mechanisms to support the sanitization of sensitive data. DPtool covers the local model 
of differential privacy, 1.e., sanitization is applied during ingestion before storage — option | from 
Section Also, DPtool supports the central model, i.e., sanitization is applied on centrally 
stored data for selected analysis functions — option 2 from Section [3.1] 

In Section|3.3.2| we describe MIA, which helps data scientists to parameterize differential privacy 
in the context of machine learning and provides a membership inference analysis. MIA covers the 
central model, however, it can also cover the local model. 

In Section we present DPsc which securely computes the differentially private median. 
DPsc covers the hybrid model — option 3 from Section [3.1|- 1.e., 1t closes the gap between the 
local and central model. 


3.3.1 DPtool: Toolbox for Differential Privacy Mechanisms 


DPtool permits to apply various anonymization methods on an input data set stored as comma- 
separated values (CSV file) and produces an anonymized output data set. 


REST API 


The DPtool exposes its services via a REST-API (REpresentational State Transfer). The REST 
service is implemented with Spring. Spring is a Java based framework for the development of 
web applications and supports the implementation of restful services [Spr0Z]. The anonymization 
methods are implemented in Java as well. 

The paths for the input and output file as well as the parameters for the anonymization methods are 
provided as parameters to the REST API in JSON format. For each anonymization method, the 
column name, on which the anonymization will be applied, is required as well as the sensitivity 
and privacy parameter €. The Gaussian mechanism requires an additional parameter 6, which is 
typically chosen to be negligible in the data size [DR 14]. The sensitivity is the largest impact any 
individual (added/removed data point) can have on a function evaluation (see Definitions [3] [5). 
The main anonymization methods exposed by the REST API are listed in Table where in- 
put/output path are omitted for readability. 
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Method 


Laplace mechanism adds noise from a l 
a sd E "column": <string>, 
Laplace distribution; parameterized with 
od E "epsilon": <float>, 
sensitivity Af and privacy parameter € as in 


n "sensitivity": <float> } 
Definition [3| y 


Geometric mechanism adds noise from a 


symmetric geometric distribution to integer "column": <string> 


counts; parameterized with privacy parame- "epsilon": <float> } 


ter € as in Definition [4| 


Gauss mechanism adds noise from a Gauss "column": <string>, 


distribution; parameterized with sensitivity "epsilon": <float>, 


A> f and privacy parameters €, ô as in Defi- "sensitivity": <float>, 


nition 5] "delta": <float> y 


Exponential mechanism probabilistically 

selects an output based on its utility score; 

parameterized with privacy parameter € as 

in Definition [6] Utility scoring is provided "column": <string>, 
for two applications: i) median (based on "epsilon": <float> } 
ranks, see also Section[3.3.3) and ii) similar 

words (based on Levenshtein distance). 


Geo-Indistinguishability mechanism ran- 

domizes a location, expressed as latitude/- "latitudeColumn": <string>, 
longitude, via privacy parameter € to satisfy "longitudeColumn": <string>, 
Definition[7] "epsilon": <float> } 


Table 3.1: DPtool: Main anonymization methods 


Further methods are provided based on these main mechanism, as detailed in the following. 


DP average uses the Laplace mechanism to compute a differentially private average with 
bounded results (parameters min, max) as described in Li et al. [LLSY16} Algorithm 2.3]. 


DP sum calls the Laplace mechanism without a sensitivity parameter and automatically detects 
the required sensitivity for the sum function based on the provided data (assuming that the 
data contains min U, max U). 


Latitude/Longitude applies the geo-indistinguishability mechanism where only one of the coor- 
dinates (either latitude or longitude) are provided. 


IP-geo randomizer maps IP addresses to coordinates before calling the geo-indistinguishability 
mechanism based on mapping data, which can be specified in the application settings. 
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DPtool also supports some anonymization techniques not based on perturbation and differential 
privacy: 


Removal sets all values in the specified column to the empty string; thus, the structure of the data 
(e.g., columns) remains to avoid parsing errors 1n case of post-processing. 


Removal by delimiter removes values before/after (defined via parameter area) pre-defined de- 
limiters, 1.e., “.” (1Pv4), “:” (MAC/IPv6), “CO” (email). 


GUI 


SAP User Interface for HTML 5 (SAPUIS) is the basis for the graphical user interface. SAPUI5 
is a framework to develop responsive web applications supporting HTMLS standards [SAP20], 
which provides convenient JavaScript libraries and is suited for model-view-controller (MVC) ar- 
chitectures. The model component details the data-related logic and 1s expressed in JavaScript. 
The view component describes the user interface with HTML/XML. The controller component 
connects the model and view components and processes incoming requests, e.g., actions taken via 
the view (do apply anonymization) will update the model (applied anonymization). Configura- 
tions, e.g., button labels (for localization), are stored in 118n property files. 

Figures|3.243.4] visualize the steps in DPtool during the anonymization process. 


Ingest file 


Column Anonymization method 


Mo data 


Figure 3.2: DPtool: Step 1 — data selection and ingestion 


Initially, a client opens a browser and opens the URL corresponding to the running DPtool in- 
stance. First, the data 1s selected by browsing for 1t via a file dialog opened by pressing “Browse” 
and ingested by pressing “Ingest file”. This initial selection step is shown in Figure [3.2] Then, the 
anonymization methods per named column can be selected via drop-down menu as in Figure[3.3] 
After selecting a desired method, 1ts required parameters and input fields appear as in Figure [3.4] 
such that the methods can be parameterized accordingly. Finally, by pressing “anonymize”, the 
methods are applied on the data and the data can be downloaded with a report detailing the selected 
anonymization mechanisms and parameterization. 

Furthermore, during the parameterization step (Figure 8.4), the tool validates the parameters and 
provides tooltips for the mechanism selection, parameter selection, and highlights incorrect pa- 
rameters with a red border as seen in Figure|3.5| 
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Ingest file 


Po 
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Column 


Anonymization method 


Removal 


Anonymize 


Gauss 


Figure 3.3: DPtool: Step 2 — method selection 


Ingest file 


Fl? 
J | Tat i 
JB) synth.csv 


Anonymization 
method 


Anonymize 


Column Epsilon Sensitivity Delta 


Figure 3.4: DPtool: Step 3 — method parameterization 


Figure 3.5: DPtool: Parameter validation and tooltip 
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3.3.2 MIA: Analysis for Membership Inference Attack 


The MIA tool is a service to simulate membership inference attacks (MIA) on machine learning 
models. It can be used to assure ML quality with regards to accuracy of the anonymized models 
as well as quantify the privacy by the success probability for membership inference attacks. 


Architecture 


Theoretical details for membership inference are given in Deliverable D5.2 “First Report on Pri- 
vacy Metrics and Data Sanitisation” and the architecture of the MIA tool is detailed in Deliverable 
D5.1 “First Version of Data Sanitisation Tools” and we briefly recall the architecture here. MIA 
is mainly realized with Python and the GUI is written in SAPUI5 like DPtool. The MIA REST 
service runs in a Tornada HTTP Server. The configurations, ML models and its parameters are 
stored within a MongoDB data base system. Data sets, uploaded to MIA, are stored as files. 


Usage 


Figures are screenshots of the step-by-step guide to initialize a report with MIA. Initially, 
a client opens a browser and opens the URL corresponding to the running MIA instance. 

Step 1, 1.e., Figure shows the selection of a named machine learning model for which an 
analysis is to be created. Step 2, 1.e., Figure [3.7] shows the selection of the training data for the 
previously selected model. Step 3, 1.e., Figure[3.8| shows the parameterization of the model. This 
simple setup suffices to start the training process. Afterwards, an analysis of the membership 
inference risk and the utility with and without differential privacy can be compared. 


New MIA Privacy Report 


¡ al ) Select Model 


1. Select Model 


Available Models 


Model Name 


Purchases Model 


Step 2 


Figure 3.6: MIA: Step 1 — model selection 
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New MIA Privacy Report 


Select Dataset 


2. Select Dataset 


Available Datasets 


Dataset Name Classes Samples 


Train: 100000 Test: 
100000 


Figure 3.7: MIA: Step 2 — data selection 


New MIA Privacy Report 


1 | Select Model Z | Select Dataset Configure 
3. Configure 


Report name: Mil on purchases 
Epochs: 100 
Learning rate: 
Batch size: 
Number of shadow models: 


Noise Multiplier: 


Create Report 


Cancel 


Figure 3.8: MIA: Step 3 — parameterization 
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3.3.3 DPsc: Secure Computation of Differentially Private Statistics 


The theoretical insights that led to DPsc are detailed in Deliverable D5.4 “Final versions of tools 
for data sanitisation and computation” based on a published paper [BK20b] and an extension for 
the multi-party setting was detailed in Deliverable D5.5 “Report on data sanitisation and com- 
putation” and also published as a paper [BK20a]. In this section, we focus on the tool itself, its 
underlying technology and usage. 


Rank-based Statistics 


DPsc is a protocol to securely compute differentially private rank-based statistics over distributed 
data held by two parties. Order statistics, also called ranked-based statistics, are defined over 
a sorted data set D. The rank is the first, zero-based position of an element in a sorted data set. 
Figure[3.9| visualizes the ranks for a sorted data set and includes examples for rank-based statistics, 
namely, the minimum, median, and maximum of D. While DPsc supports any rank-based statistics 
by, e.g., padding the data accordingly [AMP10], we focus on the median, which is the element 
that roughly splits the sorted data in half. 
Data set D 


o 2) 3 | 58113121 
Rank 0 1 3 4 5 6 7 8 


min median max 


Figure 3.9: Sorted data set D with ranks per unique datum 


Overview 


Next, we provide an overview of DPsc which is described in detail in deliverable D5.4 “Final 
versions of tools for data sanitisation and computation”. 

Firstly, if the data size n is not sublinear in the size of the domain |U], i.e., n > log, |U], then 
DPsc first prunes the data according to Aggarwal et al. [AMP10]: The parties A and B compute 
their local median values ma and mg, respectively, over their own data and securely compare it via 
garbled circuits to learn c = ma < mg. If the resulting bit c is 1, 1.e., ma is smaller than mg, then 
A discards the upper half of its sorted data and B discards its lower data half. If c 1s O, they do the 
opposite. This is repeated until the data is small enough, i.e., sublinear in JU]. 

Secondly, the parties securely merge their pre-sorted (potentially pruned) data via garbled circuits 
and learn a secret shared version of the merged, sorted data. 

Thirdly, the parties compute the selection probabilities over the secret shared data. This can be 
done via multiplication with known values and subtraction of secret shared values. 

Finally, the parties sample the DP median via inverse transform sampling over the previously 
computed selection probabilities. 


Architecture 


DPsc is implemented with the mixed-protocol secure computation framework ABY 
which is actively maintained. ABY provides basic secure protocols for, e.g., addition and com- 
parison. It supports garbled circuits as well as secret sharing and provides efficient conversions 
between these schemes. 
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ABY programs are written in C++ and define a circuit circ, 1.e., all operations are expressed as 
logical gates. For example, function circ->PutGTGate(inputa, inputb) adds a comparison 
gate to the circuit, which evaluates to 1 if inputa is greater than inputb. 

For development purposes, ABY can be setup as described on ABY’s github pag¢?| Briefly, on 
a Linux system one installs the requirements] clones the repository, and builds ABY, including 
example applications, as follows: 


cd ABY/ && mkdir build && cd build 
cmake .. -DABY_BUILD_EXE=0n 


make 
The source structure of an ABY program, like DPsc, is as follows. The program folder contains 
e file CMakeLists.txt to build the project with cmake 
e file dp_sc.cpp which contains the main function and the logic to parse the parameters 
e folder common containing 


— file dp_sc.cpp containing the main logic 


— file dp_sc.h containing the interface definition 


To create an executable from a program, one calls cnake . && make . in the program folder. 
After creating an executable, the executable can be transferred to the participating parties. Then, 
the parties have to execute 1t as explained for DPsc next. 


Usage 


DPsc is a command line tool which acts as server or client depending on how it is invoked. To 
jointly execute DPsc, one party invokes it as a server the other one as a client. The only result is 
the differentially private median over their joint data set — nothing else 1s revealed about the data. 
DPsc has the following mandatory parameters: 


-r [Role: 0/1 for server/client, required] 
-f [Data set file (one value per line), required] 
-n [Number of elements, required] 


-b [Bit-length, required] 
DPsc supports the following optional parameters: 


-e [Epsilon, default: 1n(2), optional] 

-i [IP-address, default: localhost, optional] 

-p [Port, default: 7766, optional] 

-c [Convert to arithmetic sharing for ADD/MUL], optional] 

-m [Min universe element (default: min of input), optional] 

-M [Max universe element (default: max of input), optional] 

-N [Number of nonces (per party) for unbiased sampling, optional] 


-B [Use biased sampling (uses only one nonce, i.e., -N 1), optional] 


2 https: //github.com/encryptogroup/ABY 


3 g++, make, cmake, libgmp-dev, libssl-dev, libboost-all-dev. 
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-s [Symmetric Security Bits, default: 128, optional] 
-a [Accuracy, default: 0.90, optional] 
-o [Print debug output, optional] 


As an example, consider the first party executing 

./dp_sc -r 0 -f data_1.text -n 10000 -e 0.1 -b 64 -N 30 
and the second party executing 

./dp_sc -r 1 -f data_2.text -n 10000 -e 0.1 -b 64 -N 30 


The first starts DPsc as a server, reads the first 10,000 values from data set data_1.txt, sets € to 
0.1, defines bit-length 64, and uses 30 input nonces for rejection sampling. The second does the 
same, however with the role of client and with another data set. In a WAN setting, the IP address 
of the server must be specified as well with parameter -i and, potentially, the port with parameter 
-p. Next, we further detail some parameters. 

Parameter -c adds conversions: While the main part of our protocol is written as a garbled circuit, 
we support converting arithmetic operations to secret sharing. In more detail, addition of secret 
values and multiplication with known values are costly in Boolean circuits (1.e., garbled circuits) 
but “free” in arithmetic circuits (1.e., secret sharing) in the sense that no interaction is required 
between the parties. Such conversion requires some interaction between the parties. However, 
in our evaluation (detailed in Deliverable D5.4 “Final versions of tools for data sanitisation and 
computation”) a protocol execution with conversion is much faster in a wide-area network than 
one without conversion, as the conversion overhead is much smaller than the overhead for, e.g., 
multiple additions in Boolean circuits. 

Note that the parties can input the minimum (-m) and maximum domain element (-M) if these 
are not contained in the data set. The minimum and maximum domain elements are required to 
satisfy differential privacy, as all neighboring data sets must be considered, which can contain any 
element from the entire data domain. Recall, a neighboring data set D’ of data set D is just D with 
an element removed or added. We must compute non-zero selection probabilities for all possible 
output elements, 1.e., domain elements in the case of the median, to satisfy differential privacy. 
Parameter -B selects biased sampling, by default (unbiased) rejection sampling is used. Both 
methods refer to uniform sampling, 1.e., outputting a random element from a fixed range where all 
elements are equally likely to be selected. Biased sampling is simpler but slightly favors smaller 
values, 1.e., it is not perfectly uniformly random. These different sampling methods were im- 
plemented to evaluate if the former is more efficient than the latter. However, in our evaluation 
(detailed in Deliverable D5.4 “Final versions of tools for data sanitisation and computation”) re- 
jection sampling adds only negligible overhead while being unbiased. Biased sampling of integer 
range [0, R) can be implemented for two semi-honest parties as follows: First, each party selects 
a random number, a so-called nonce; then, the parties compute x by XORing their nonces, and 
output x mod R. However, R is secret as it is the normalization term (1.e., denominator) in Def- 
inition [6] Thus, R is computed on the sensitive data, which makes it sensitive as well, as 1t can 
leak information about the dataļf]| The parties cannot learn R; however, an upper bound U can be 
derived by assuming the largest possible utility scores for each element. Thus, the parties input 
nonces from [0, U]. This sampling method is biased, as modulo R does not evenly divide the range 


4 For example, the normalization term can differ between neighboring data sets, which suffices to differentiate them 
and violate differential privacy. 
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of possible values for x, 1.e., the modulo operation slightly favors smaller values as outputs. Re- 
jection sampling, on the other hand, is unbiased. Here, we use multiple nonces (parameter -N). 
For each nonce per party, we combine them via XOR to x as before and compute r as the AND 
of x with a bit-mask b. The bits in b are zero until the first position where R has a one, and con- 
sists of only ones afterward (e.g., b = 00111) for R = 001012). Now, we reject r if it is larger 
than R and otherwise stop and output r. All nonces might be rejected, however, the probability is 
negligible in the number of nonces (i.e., 2% for N nonces). We use a fixed number of nonces, 
as secure computation does not allow conditional loops, e.g., executing a loop until a condition is 
met, when the condition 1s secret and the number of iteration steps might reveal something about 
the condition. For more details about the implemented sampling methods, we refer to deliverable 
D5.4 “Final versions of tools for data sanitisation and computation”. 

Parameter -s corresponds to the security parameter A in Definition |2| 

Parameter -a defines the selection accuracy, 1.e., the probability to select an output from the re- 
maining elements instead of the pruned ones (see Deliverable D5.4 “Final versions of tools for 
data sanitisation and computation” for details). 

Parameter -o provides debug output during the computation. This parameter should only be used 
for demonstration purposes as it reveals all secret values and nonces! 


3.4 Summary 


UC3 considers a cloud-based data market, where different parties want to share insights over their 
data sets without revealing their data to each other. To allow such privacy-preserving analytics in 
a cloud-based environment, we presented different tools. Namely, DPtool, MIA, and DPsc, which 
enable the use case and cover the entire life-cycle of MOSAICrOWN, from ingestion and storage 
to analytics. 

Next, we briefly summarize the different tools. 


DPtool provides simple interfaces to apply differential privacy mechanisms on data sets to san- 
itize them during ingestion (1.e., only sanitized data is stored) or during analytics (1.e., data 
is stored in plaintext or encrypted). DPtool provides a programmatic interface in the form 
of a REST API allowing automatic sanitization. Furthermore, DPtool provides a graphical 
interface in the form of SAPUI5 which permits manual sanitization, e.g., to support analysts 
in evaluating different techniques and parameters with immediate feedback by displaying a 
subset of the sanitized data. 


MIA supports data scientists in the parameterization of differential privacy in the context of ma- 
chine learning, e.g., during ingestion or investigation of analytical functions of interest. MIA 
provides a graphical user interface based on SAPULS and assists data scientists and analysts 
in evaluating the protection guarantees of differential privacy via membership inference 
risks as well as the utility-privacy trade-offs inherent in any anonymization technique. 


DPsc enables differentially private rank-based statistics over two distributed data sets (as envi- 
sioned in the use case) without sharing the data at all. In other words, the ingestion and 
storage phase are skipped and we go right to the analytical phase. DPsc combines secure 
computation and differential privacy, 1.e., the parties collaboratively compute a DP mecha- 
nism (in this case the exponential mechanism) while only learning the mechanism’s output 
but none of the inputs from the other party. DPsc is a command-line tool with few required 
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parameters, e.g., input data and privacy parameter, and multiple optional parameters (with 


sane defaults). 


Mi MOSAICrOWN 


Deliverable D2.4 


4. Conclusions 


This deliverable reported on the prototypes developed during MOSAICrOWN to satisfy the use 
cases by the industry partners (EISI, MC, SAP SE). The prototypes leverage technologies devel- 
oped in Work Packages 3-5. 

Chapter|I|presented tools for Use Case 1. This use case, by EISI, considers data protection in the 
context of platforms for intelligent connected vehicles where, e.g., electrical vehicles and charg- 
ing infrastructures can securely collect and exchange insights to improve planning and charger 
allocations. To realize this use case, different tools have been designed and developed. First, an 
automotive tool to facilitate data collection and ingestion from the electric vehicle to the data mar- 
ket. Second, a web tool to enable the authorized data access via a user interface. Third, a policy 
engine realizing access control, 1.e., permitting or denying access, after parsing the corresponding 
MOSAICrOWN policy. Fourth, an encrypted filesystem providing confidentiality and enabling 
secure storage of the personal data. 

Chapter [2] detailed tools for Use Case 2. This use case, by MC, focuses on financial institutions 
and their transaction-level data anonymization. The proof-of-concept tools provide a platform 
for different wrapping as well as data sanitization techniques to augment policy-based protection 
mechanisms which control data access, usage, and sharing. Moreover, the tools are flexible when 
considering different privacy regulations (e.g., GDPR and LGPD). The tools automatically identify 
the semantics of each field (in terms of type, data distribution, and content) and suggest for each 
field the wrapping techniques related to the selected privacy regulation. Also, visualization of data 
analytics and summary results are provided by the tools. 

Chapter|3|described tools for Use Case 3. In this use case, by SAP SE, the goal is to enable privacy- 
preserving analytics for cloud-based data markets. In more detail, businesses want to share insights 
over their business-sensitive operational data and personal customer experience data, without re- 
vealing the underlying sensitive data. To realize this use case, three tools where designed to cover 
different aspects of privacy-preserving analytics: DPtool, MIA, and DPsc. DPtool provides var- 
ious mechanisms for anonymization with differentially privacy, such as additive noise sampled 
from the Laplace distribution (for reals) or the Geometric distribution (for integers). MIA aims to 
help data scientists during the parameterization of differential privacy mechanisms in the context 
of machine learning and provides a risk-based evaluation of the applied parameterization. DPsc 
combines differential privacy with cryptographic techniques to enable distributed anonymization 
on data of multiple parties releasing only the anonymized results for rank-based statistics which 
includes percentiles such as the median (the 50-th percentile). 
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